Linear regression Lab

Andrea Sánchez-Tapia , Paulinha Lemos-Costa , Sara Mortara
2024-02-01

Simple linear models

Linear regressions in R assume a linear relationship between one or more predictor variables \(X_i\) and a response variable \(Y\).

\[y_i \sim \mathcal{N}(\mu_i, \sigma) \\ \mu_i = \beta_0 + \beta_1 x_i\]

The first element of this model describes how the response variable \(Y\) can be modeled as a normal random variable with mean and standard deviation.

The second element of this model adds the effect of the predictor variable \(X\) on \(Y\), based on the causal relationship \(X \rightarrow Y\). In this simple case, the relationship is linear and can be described by the equation of a line with intercept \(\beta_0\) and slope \(\beta_1\).

Generating the dataset

In the introductory lecture for linear models, we used an example dataset to fit a simple regression model.

The dataset itself was generated according to the specifications of the model above. We are going to recreate the dataset and some of the calculations presented during the class.

set.seed(4) #ensures the result is reproducible. The random number generation can vary between Operative systems so maybe there will be different datasets in the classroom
x_i <- seq(1, 5, by = 0.5)
n <- length(x_i)
mu <- 1.2 + 3.5 * x_i
y_i <- rnorm(n = 9, mean = mu, sd = 3)
lm_data <- data.frame(x_i,  mu, y_i, res = y_i - mu)

Create a plot for the dataset

plot(y_i ~ x_i)

As you can see, plotting the data suggests a linear relationship between these two variables

Regression coefficients

Let’s calculate the coefficients for the regression line according to the Ordinal Least Squares method:

\[\widehat{\beta_1}=\frac{\sum_{i=1}^{n}(x_i-\bar{x})(y_i-\bar{y})}{\sum_{i=1}^{n}(x_i-\bar{x})^2}\]

and

\[ \hat{\beta_0} = \bar{y} - \hat{\beta_1}\bar{x} \]

x <- x_i
y <- y_i
beta1_hat <- sum((x - mean(x)) * (y - mean(y))) / sum((x - mean(x)) ^ 2)
beta1_hat
[1] 3.84557
# Since mu = beta0 + beta1(X): 
beta0_hat <- mean(y) - beta1_hat * mean(x)
beta0_hat
[1] 1.459434

Add a regression line with the values of the coefficients you calculated

plot(y ~ x)
abline(a = beta0_hat, b = beta1_hat, col = "red")

The sample is not the population

When we generated the dataset (here), \(\mu\) was specified as:

mu <- 1.2 + 3.5 * x_i

This means \(\beta_0 = 1.2\) and \(\beta_1 = 3.5\). How do these values compare to the coefficients you just calculated?

Add a population line to the plot you previously created by completing the code below:

plot(y ~ x)
abline(a = beta0_hat, b = beta1_hat, col = "red")
abline(a = ..., b = ..., col = "blue")

See the code by clicking Show code:

Show code
plot(y ~ x)
abline(a = beta0_hat, b = beta1_hat, col = "red")
abline(a = 1.2, b = 3.5, col = "blue")

Fitting a linear model in R

Function lm() fits regression models using Ordinal Least Squares. The values of the coefficients can be found using function coef().

mod <- lm(y ~ x)
coef(mod)
(Intercept)           x 
   1.459434    3.845570 

Compare this result with the calculations you did previously.

Let’s do the calculations for Sums of squares, residual sum of squares and Sums of squares of the regression

\(SS_y\), \(RSS\), and \(SS_{Reg}\)

\[RSS = \sum_{i=1}^{n}(y_i-\hat{y}_i)^2 \]

\[SS_y = {\sum_{i=1}^{n}(y_i-\bar{y})^2}\]

\[SS_Y = SS_{Reg} + RSS\]

Challenge: Given that you have already learned that the \(R^2\) of a model is the ratio between the sums of squares of the regression and the total sum of squares, calculate it and check the result comparing with the summary of the model.

Show code
pred <- predict(mod)
RSS <- sum((y - mean(pred)) ^ 2)
SSY <- sum((y - mean(y)) ^ 2)
SSreg <- SSY - RSS

SSreg / SSY

summary(mod)

Checking model assumption

The assumptions of the linear regression model are:

  1. The linear model correctly describes the functional relationship between \(X\) and \(Y\);

  2. The variable \(X\) is measured without error, therefore, allowing us to associate the error component entirely to the \(Y\) variable;

  3. \(Y\) values are independent with normally distributed errors;

  4. Variances are constant along the regression line.

The fourth assumption allows us to use a single constant \(\sigma\) for the variance. Non-constant variances can be recognizing using diagnostic plots. Let’s use them.

d <- residuals(mod)
y_hat <- predict(mod)

plot(d ~ y_hat)
abline(h = 0)

If the linear model fits the data well, this residual plot should exhibit a scatter of points that approximately follows a normal distribution and are uncorrelated with the fitted values. Values under the dashed line are overestimates of the model, while values over the line, underestimates.

In order to check the assumption of homoscedasticity, you should observe if the average size of the residuals increase with the increase in \(X\) (i.e., heterocedasticity).

Another visualization for model diagnostic are the plots of the object containing the fitted model. There are four plots, and we will change the graphical window so we can see all of them at the same time. As Michael J. Crawley would say:

It takes experience to interpret these plots, but what you do not want to see is lots of structure or pattern in the plot.

par(mfrow = c(2, 2)) # sets a graphical window with 2 rows-2 columns
plot(mod)
par(mfrow = c(1, 1)) # setting back to the default 1-1

The first plot, is basically the same as the last plot you created. As we saw in the last plot, it is a problem if the variance in the residuals increase as the fitted values get larger, therefore testing our fourth assumption. The second plot is a normal quantile-quantile plot. This plot should be a straight line if the errors are normally distributed (our third assumption). The third plot is very similar to the first one, but the y-axis is at a different scale - now all values are positive because of the squared root transformation. The last plot indicates influential points, i.e., points that have biggest effect on parameter estimates.